B-ROC Curves for the Assessment of Classifiers over Imbalanced Data Sets
نویسندگان
چکیده
The class imbalance problem appears to be ubiquitous to a large portion of the machine learning and data mining communities. One of the key questions in this setting is how to evaluate the learning algorithms in the case of class imbalances. In this paper we introduce the Bayesian Receiver Operating Characteristic (B-ROC) curves, as a set of tradeoff curves that combine in an intuitive way, the variables that are more relevant to the evaluation of classifiers over imbalanced data sets. This presentation is based on section 4 of (Cárdenas, Baras, & Seamon 2006). Introduction The term class imbalance refers to the case when in a classification task, there are many more instances of some classes than others. The problem is that under this setting, classifiers in general perform poorly because they tend to concentrate on the large classes and disregard the ones with few examples. Given that this problem is prevalent in a wide range of practical classification problems, there has been recent interest in trying to design and evaluate classifiers faced with imbalanced data sets (Japkowicz 2000; Chawla, Japkowicz, & Kołcz 2003; Chawla, Japkowicz, & Kołz 2004). A number of approaches on how to address these issues have been proposed in the literature. Ideas such as data sampling methods, one-class learning (i.e. recognitionbased learning), and feature selection algorithms, appear to be the most active research directions for learning classifiers. On the other hand the issue of how to evaluate binary classifiers in the case of class imbalances appears to be dominated by the use of ROC curves (Ferri et al. 2004; 2005) (and to a lesser extent, by error curves (Drummond & Holte 2001)). The class imbalance problem is of particular importance in intrusion detection systems (IDSs). In this paper we present and expand some of the ideas introduced in our research for the evaluation of IDSs (Cárdenas, Baras, & Seamon 2006). In particular we claim that for heavily imbalanced data sets, ROC curves cannot provide the necessary Copyright c © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. intuition for the choice of the operational point of the classifier and therefore we introduce the Bayesian-ROCs (BROCs). Furthermore we demonstrate how B-ROCs can deal with the uncertainty of class distributions by displaying the performance of the classifier under different conditions. Finally, we also show how B-ROCs can be used for comparing classifiers without any assumptions of misclassification costs. Performance Tradeoffs Before we present our formulation we need to introduce some notation and definitions. Assume that the input to the classifier is a feature-vector x. Let C be an indicator random variable denoting whether x belongs to class zero: C = 0 (the majority class) or class one: C = 1 (the minority class). The output of the classifier is denoted by A = 1 if the classifier assigns x to class one, and A = 0 if the classifier assigns x to class zero. Finally, the class imbalance problem is quantified by the probability of a positive example p = Pr[C = 1]. Most classifiers subject to the class imbalance problem are evaluated with the help of ROC curves. ROC curves are a tool to visualize the tradeoff between the probability of false alarm PFA ≡ Pr[A = 1|C = 0] and the probability of detection PD ≡ Pr[A = 1|C = 1]. Of interest to us in the intrusion detection community, is that classifiers with ROC curves achieving traditionally “good” operating points such as (PFA = 0.01,PD = 1) would still generate a huge amount of false alarms in realistic scenarios. This effect is due in part to the class imbalance problem, since one of the causes for the large amount of false alarms that IDSs generate, is the enormous difference between the large amount of normal activity compared to the small amount of intrusion events. The reasoning is that because the likelihood of an attack is very small, even if an IDS fires an alarm, the likelihood of having an intrusion remains relatively small. That is, when we compute the posterior probability of intrusion given that the IDS fired an alarm, (a quantity known as the Bayesian detection rate, or the positive predictive value (PPV)), we obtain: PPV ≡ Pr[C = 1|A = 1] = pPD pPD +(1− p)PFA (1) Therefore, if the rate of incidence of an attack is very small, for example on average only 1 out of 105 events is an attack
منابع مشابه
Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs
Hidden Markov models (HMMs) have been shown to provide a high level performance for detecting anomalies in sequences of system calls to the operating system kernel. Using Boolean conjunction and disjunction functions to combine the responses of multiple HMMs in the ROC space may significantly improve performance over a ‘‘single best’’ HMM. However, these techniques assume that the classifiers a...
متن کاملLearning When Data Sets are Imbalanced and When Costs are Unequal and Unknown
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. W...
متن کاملROC curves and video analysis optimization in intestinal capsule endoscopy
Wireless capsule endoscopy involves inspection of hours of video material by a highly qualified professional. Time episodes corresponding to intestinal contractions, which are of interest to the physician constitute about 1% of the video. The problem is to label automatically time episodes containing contractions so that only a fraction of the video needs inspection. As the classes of contracti...
متن کاملA hybrid approach to learn with imbalanced classes using evolutionary algorithms
There is an increasing interest in application of Evolutionary Algorithms to induce classification rules. This hybrid approach can aid in areas that classical methods to rule induction have not been completely successful. One example is the induction of classification rules in imbalanced domains. Imbalanced data occur when some classes heavily outnumbers other classes. Frequently, classical Mac...
متن کاملEvaluating Misclassifications in Imbalanced Data
Evaluating classifier performance with ROC curves is popular in the machine learning community. To date, the only method to assess confidence of ROC curves is to construct ROC bands. In the case of severe class imbalance with few instances of the minority class, ROC bands become unreliable. We propose a generic framework for classifier evaluation to identify a segment of an ROC curve in which m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006